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fvj . External information, such as prior information or expert opin- 

ions, can play an important role in the design, analysis and interpreta- 
tion of clinical trials. However, little attention has been devoted thus 
Mh , far to incorporating external information in clinical trials with binary 

■^L ■ outcomes, perhaps due to the perception that binary outcomes can 

be treated as normally-distributed outcomes by using normal approx- 
C^ , imations. In this paper we show that these two types of clinical trials 

(yj ■ could behave differently, and that special care is needed for the anal- 

ysis of clinical trials with binary outcomes. In particular, we first ex- 
amine a simple but commonly used univariate Bayesian approach and 
observe a technical flaw. We then study the full Bayesian approach 
using different beta priors and a new frequentist approach based on 
/p»s ■ the notion of confidence distribution (CD). These approaches are il- 

^vj ' lustrated and compared using data from clinical studies and simula- 

\^2 ■ tions. The full Bayesian approach is theoretically sound, but surpris- 

ingly, under skewed prior distributions, the estimate derived from the 
marginal posterior distribution may not fall between those from the 
J^> , marginal prior and the likelihood of clinical trial data. This counter- 

intuitive phenomenon, which we call the "discrepant posterior phe- 
nomenon," does not occur in the CD approach. The CD approach is 
also computationally simpler and can be applied directly to any prior 
K^ ' distribution, symmetric or skewed. 
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1. Introduction. In pharmaceutical fields as well as many others, there 
is great interest in conducting randomized trials with designs that can en- 
able combining external information, such as prior information or expert 
opinions, with trial data to enhance the interpretation of the findings. In 
early landmark works Spiegelhalter, Freedman and Parmar (1994) and Par- 
mar, Spiegelhalter and Freedman (1994) provided an interesting illustra- 
tion of integrating expert opinions with data from cancer trials using a 
Bayesian framework. Parmar, Spiegelhalter and Freedman (1994) noted that 
the added flexibility to such trials to stop for efficacy, futility or safety can 
greatly increase the efficiency of clinical research. Designs for incorporating 
external information are also useful in drug development when a pilot study, 
also known as a hypothesis generating study, is conducted with a sample size 
that may be inadequate for detecting clinically meaningful treatment effects. 
In this case, relevant information including trial results and expert opinions 
can be used to help decision makers with whether to proceed with a larger 
confirmatory study and, if so, how to design it. 

Although the applications have drawn increasing interest in recent years, 
little attention has been devoted to the special yet commonly seen clinical 
trials with binary outcomes, partly due to an inaccurate common belief 
that little is new regarding the trials with binary outcomes. In this paper, 
motivated by a case study of a clinical trial with binary outcomes in a 
migraine therapy, we develop and compare statistical methods which can 
effectively combine information from clinical trials of binary outcomes with 
information from surveys of expert opinions. The results show that clinical 
trials with binary outcomes can behave quite differently. Thus, special care 
is warranted for such trials. 

The Bayesian paradigm has played a dominant role in combining ex- 
pert opinions with clinical trial data. Almost all methods currently used are 
Bayesian; see, for example, Berry and Stangl (1996), Spiegelhalter, Abrams 
and Myles (2004), Carlin and Louis (2009) and Wijeysundera et al. (2009). 
In the Bayesian paradigm, as illustrated in Spiegelhalter, Freedman and 
Parmar (1994), a prior distribution is first formed to express the initial 
beliefs concerning the parameter of interest based on either some objec- 
tive evidence or some subjective judgment or a combination of the two. 
Subsequently, clinical trial evidence is summarized by a likelihood func- 
tion, and a posterior distribution is then formed by combining the prior 
distribution with the likelihood function. Although this general Bayesian 
paradigm also applies to the special case of clinical trials with binary out- 
comes, the simple univariate Bayesian approach developed in Spiegelhalter, 
Freedman and Parmar (1994) for clinical trials with normally distributed 
outcomes cannot be applied directly to clinical trials with binary outcomes. 
The latter point is contrary to the common belief which we elaborate be- 
low. 
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Univariate Bayesian approach. 

Consider a clinical trial of binary outcomes with a treatment group and 
a control group. Denote by Xu ~ Bernoulli(pi), for i = l,...,m, the re- 
sponses from the treatment group and by X$i ~ Bernoulli (po), for i = 1, . . . , 
no, the responses from the control group. Assume that the parameter of 
interest is the difference of the success rates between the two treatments, 
5 = p\ — po, and its prior distribution t\(S) is formed based on some ob- 
jective evidence and/or some subjective judgment. Let 5 = X± — Xq and 
C 2 d = X t (l - Xi)/m + Xp(l - X )/no, where X x = £™=i x u/ n i and X = 
Yli=i Xoi/riQ. Note that C\ is an estimator of Cj = var(<5) = p\{\ — p\)jn\ + 
Po(l —Po)/ n o- A popular univariate Bayesian approach, as seen in Spiegel- 
halter, Freedman and Parmar [(1994), pages 360-361], would then treat 

(1.1) !(S\S) = exp{-i log(2vrQ 2 ) - \{5 - 8) 2 /C 2 d } 

as the "likelihood function" of <5, and proceed to compute the posterior 
distribution of 5, 7r ((51 data), according to 

7r(5|data)cx7r(<5)i(<5|<5). 

When the prior ir(-) is modeled as a normal distribution, the approach 
involves an explicit posterior and is straightforward. Although this univari- 
ate Bayesian approach has been used in practice, it has in fact a theoretical 
flaw. Strictly speaking, (1.1) is not a likelihood function of 5, even in the 
context of estimated likelihood [see, e.g., Boos and Monahan (1986)]. In par- 
ticular, in the clinical trial motioned above, a conditional density function 
/(data|<5) solely depending on a single parameter 5 = pi — po does not exist, 
and, thus, it is not possible to find a "marginal likelihood" of 5. Therefore, 
7r((5|data) oc 7r(5)/(data|5) is not well defined and any univariate Bayesian 
approach focusing directly on 5 is not supported by the Bayesian theory. 

The point above is alluded to in the argument made by Efron (1986) and 
Wasserman (2007) that a Bayesian approach is not good for "division of la- 
bor" in the sense that "statistical problems need to be solved as one coherent 
whole in a Bayesian approach," including "assigning priors and conducting 
analyses with nuisance parameters." This observation suggests that a sound 
Bayesian solution in the current context is a full Bayesian model that can 
jointly model po and p\ or their reparametrizations. Joseph, du Berger and 
Belisle (1997) presented such a full Bayesian approach using (mostly) a set 
of independent beta priors for po and p\. However, the paper focused mainly 
on the utility of the approach in sample size determination rather than on its 
general performance in the context of clinical trials with binary outcomes. 
In the present paper, in addition to the independent beta priors, we broaden 
the scope of the full Bayesian approach to include three more flexible priors, 
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namely, independent hierarchical beta priors, dependent bivariate beta (BI- 
BETA) priors [Olkin and Liu (2003)] and dependent hierarchical bivariate 
beta priors. 

We also develop a Markov Chain Monte Carlo (MCMC) algorithm for im- 
plementing these full Bayesian approaches, since most resulting posteriors 
do not assume explicit forms. The full Bayesian approaches are theoretically 
sound, and intuitively would have been expected to provide a systematic so- 
lution to the problems in our case study. However, a close examination of the 
situation with skewed priors reveals a surprising phenomenon in which the 
estimate derived from the posterior distribution may not be between those 
from the prior distribution and the likelihood function of the observed data 
(details in Section 4.2). We shall refer to this phenomenon as the "discrepant 
posterior phenomenon." To the best of our knowledge, this discrepant poste- 
rior phenomenon has not been reported elsewhere. This observation indicates 
that clinical trials of binary outcomes can behave differently from the normal 
clinical trails studied in Spiegelhalter, Freedman and Parmar (1994). It also 
shows an inherent difficulty in the modeling of trials with binary outcomes, 
especially if po and p± are potentially correlated. This discrepant posterior 
phenomenon manifests itself in settings beyond binary outcomes, and it has 
far reaching implications in Bayesian applications in general, as we discuss 
in Section 5. 

In addition to studying the full Bayesian approach, we also propose a 
new frequentist approach for combining external information with clinical 
trial data. Efron (1986) and Wasserman (2007) argued that a frequentist ap- 
proach has "the edge of division of labor" over a Bayesian approach. They 
illustrated this point by using the example of population quantile, which can 
be directly estimated in a frequentist setting by its corresponding sample 
quantile without any modeling effort or involving other (nuisance) parame- 
ters. In our context, this indicates that we can use a univariate frequentist 
approach to model directly the parameter of interest 5, without having to 
model jointly the treatment effects (po,Pi)- On the other hand, it is clear 
that a standard frequentist approach is not equipped to deal with external 
information such as expert opinions, which are not actual observed data 
from the clinical trials. To overcome this difficulty, we take advantage of the 
confidence distribution (CD), which uses a sample-dependent distribution 
function to estimate a parameter of interest [see, e.g., Schweder and Hjort 
(2002) and Singh, Xie and Strawderman (2005)]. In particular, we use a CD 
to summarize external information or expert opinions, and then combine it 
with the estimates from the clinical trial. This alternative scheme can be 
viewed as a compromise between the Bayesian and frequentist paradigms. It 
is a frequentist approach, since the parameter is treated as a fixed value and 
not a random entity. It nonetheless also has a Bayesian flavor, since the prior 
expert opinions represent only the relative experience or prior knowledge of 
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the experts but not any actual observed data. The CD approach is easy to 
implement and can be a useful data analysis tool for the type of studies 
considered in the present paper. 

The main emphasis of the paper is on the study and comparison of the 
methods for incorporating expert opinions with clinical trial data in the 
binary outcome setting, and not the methods for pooling together individual 
expert opinions. The latter have been discussed extensively by Genest and 
Zidek (1986). The goal of this research is to raise awareness of the complexity 
of the practice of incorporating external information. Although it draws 
attention to a difference between Bayesian and non-Bayesian approaches in 
practice, it is not meant to either promote or criticize any of the Bayesian 
or frequentist approaches. 

The rest of this section describes a pilot clinical study in a migraine 
therapy by Johnson and Johnson, Inc. In Section 2 we develop full Bayesian 
approaches with four different priors and implement the approaches through 
an MCMC algorithm. In Section 3 we present the alternative approach of 
frequentist Bayes compromise using CDs. In Section 4 we illustrate the ap- 
proaches discussed in Sections 2 and 3 using the data presented in Sec- 
tion 1.1. We also conduct a simulation study to compare the performance 
of these approaches in situations where the prior distributions are skewed. 
Finally, we provide in Section 5 some concluding remarks and discussions. 

1.1. Application: The pilot study on migraine therapy, background and 
data. Our data are collected from a recent clinical study on patients with 
migraine headaches. 3 The objective was to determine the potential impact 
of a preventive migraine therapy, topiramate, on the therapeutic efficacy of 
the acute migraine therapy, almotriptan. 

The study consisted of a 6-week open-label phase followed by a random- 
ized double-blind phase that lasted 20 weeks. Patients received topiramate 
during the open-label run-in period that enabled the selection for random- 
ization of patients who could tolerate a dosing regimen of 100 mg/day and 
who met the eligibility criteria based on migraine rate. Those found eligi- 
ble were randomly assigned to receive topiramate (Treatment A) or placebo 
(Treatment B/Control). Throughout the study, almotriptan 12.5 mg was 
used as an acute treatment for symptomatic relief of migraine headaches. 
The patients recorded assessments of migraine activity, associated symp- 
toms and other relevant details into an electronic daily diary (Personal Dig- 
ital Assistant [PDA]). The numbers of patients in the treatment and the 
control groups are n\ = 59 and no = 68, respectively. The slight difference 



3 Clinical trial NCT00210496 by Janssen-Ortho LLC (Johnson & Johnson, Inc.) Web 
link: http : //clinicaltrials . gov/ct2/results?term=NCT00210496 . 
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Table 1 
Opinion survey of 11 experts on the treatment improvement 8 

Expert opinions for achievement of pain relief at 2 hours (PR2) 

Worse (%) Better (%) 

Investigator 20+ 20~1716~13 12~9 8~5 4~0 0~4 5~8 9~12 13~16 17~20 20+ 

1 5 

2 

3 

4 2 

5 

6 

7 

8 

9 
10 
11 

Group mean 0.64 



in the group size reflects the dropout of a handful of patients during the 
double-blind phase. The most common reason for these dropouts was sub- 
ject choice/withdrawal of consent. Few patients discontinued treatment due 
to limiting adverse event during the double-blind phase. 

The trial objectives and study design were presented at an investigator 
meeting prior to the start of the study. At the meeting, following the design 
of Parmar, Spiegelhalter and Freedman (1994), the study sponsor sought 
the individual opinions of each investigator expert regarding the expected 
improvement of Treatment A over Treatment B for a series of clinical out- 
comes. For illustration we focus on a specific outcome, pain relief at two 
hours (PR2) after dosing with almotriptan, one of 16 outcomes investigated 
in the trial. During the meeting, each expert was asked to use the 12 intervals 
shown in Table 1 to assign a "weight of belief," based on his/her experience, 
in the difference in the percentage of patients expected to achieve PR2 in 
the two treatment groups. In other words, each expert was given 100 "vir- 
tual patients" to be assigned to one of the 12 possible intervals of difference 
between the two treatments (from —20% to 20%) in Table 1. Table 1 shows 
the belief distributions for each of the 11 experts and the group mean. The 
histogram in Figure 1 shows the group means of the 11 experts' beliefs of 
the improvement of Treatment A over Treatment B. 

The histogram in Figure 1, derived from the arithmetic means in the last 
row of Table 1, is to be used as a (marginal) prior in our Bayesian analysis for 
the improvement of Treatment A over Treatment B. This practice effectively 
assumes that the heterogeneity of expert opinions could be averaged out by 
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Fig. 1. Distribution of experts' beliefs (group mean) on the difference in success rates 
between topiramate and control in achieving pain relief at 2 hours in migraine headaches. 

arithmetic means [cf. Genest and Zidek (1986)]. A similar assumption is also 
used in Spiegelhalter, Freedman and Parmar (1994) and the development of 
the frequentist approach in Section 3. Further discussion on heterogeneity 
among experts can be found in Section 5. 

The goal of our project is to incorporate the information in Figure 1, 
solicited from experts, with the data from the pilot clinical trial, and make 
inference about the improvement of the treatment effect. Findings from the 
inference are intended for generating hypotheses to be tested in future stud- 
ies. 



2. A full Bayesian solution: Methodology, theory and algorithm. 

2.1. Summarize external/prior information using an informative prior 
distribution. Beta distributions are often conventional choices for modeling 
the prior of the Bernoulli parameter po or p\ . They are sufficiently flexible 
for capturing distributions of different shapes. In particular, we consider four 
forms of joint beta distributions for the prior of (po,pi). 

• Independent Beta prior. Joseph, du Berger and Belisle (1997) used inde- 
pendent beta priors to summarize "pre-experimental information" about 
"two independent binomial parameters" po and p\ as follows: vr(po 5 Pi) = 
7T (Po) 7T (Pi) with 7r(po) ~ BETA((/o,?"o) and 7r(pi) ~ BETA(gi,ri). Here, 
(<7o> r o><7i> r i) are unknown prior parameters (hyperparameters) which can 
be estimated using the method of moments, following Lee (1992) and 
Joseph, du Berger and Belisle (1997). Specifically, for our clinical study, 
the average treatment effect and its standard deviation, \x& and ad, can 
be obtained (estimated) from the mean and standard deviation of the 
histogram of Figure 1. Based on previous clinical trials [cf. Silberstein 
et al. (2004) and Brandes et al. (2004)], the average effectiveness jjlq of 
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Treatment B and its standard deviation <7o can also be obtained. We 
can estimate the prior parameters (qoi r o,Qii r i) by solving the equations 
[M) = E(p ) = W(<?o + r 'o), Ho + Hd = E(p 1 ) = q 1 /(q 1 +r x ), erg = var(p ) = 
qoro/{(qo + r ) 2 (go + ^o + 1)} and a\-a\ = var(pi) = q\ri/{(qi + ri) 2 (qi + 
ri + 1)}. 

Independent hierarchical Beta prior. Gelman et al. [(2004), Chapter 5] 
suggested that hierarchical priors are more flexible and can avoid "prob- 
lems of over-fitting" in Bayesian models. We modify their approach to 
reflect the informative prior in our problem setting with two sets of in- 
dependent Bernoulli experiments. Specifically, we still model the prior 
of {po,Pi) independently with 7r(po,Pi) = 7T (po) 7 ^(pi), but each n(pi), for 
i = 0, 1, assumes two levels of hierarchies as follows: 

Pi\qi,ri~BETA(qi,ri), 

& = — T BETA( ai ,A) and m = qi + n ~ GAMMA^ + ft). 

% + ^j 

Here, £j is the mean and r/j is the "sample size" of BETA(cfo,rj), following 
Gelman et al. (2004), and GAMMA (£) refers to the standard gamma dis- 
tribution whose shape parameter is t and scale parameter is 1. Again, we 
use the method of moments to estimate the unknown parameters in this 
(hyper) prior distribution. In this case, the first two marginal moments of 
Pi are E(p$) = a;/ (a, + /%) and var(pj) = ai/{(ai + /%)(a + /?o + !)}[«« + 
/3j + /{(a; + l) _1 x a<+ ^'e -a; /r(aj + /%)}<ix], for £ = 0,1. The prior param- 
eters (ao,Po,ai,Pi) are obtained by solving the equations no = E(po), 

Ho + Hd = E(pi), erg = var(po) and ct^ - ag = var(pi). 
Dependent bivariate Beta (BIBETA) prior. Although most of the analysis 
in Joseph, du Berger and Belisle (1997) was based on the independent beta 
prior, a dependent beta prior, with 7r(po) and ir(pi\po) = ^(po,Pi)/^(po) 
both being beta distributions, was considered in a numerical example 
there. Joseph, du Berger and Belisle (1997) commented that "it is often 
desirable to allow dependence between p$ and pi." This point is par- 
ticularly relevant to our case study, since almotriptan is used in both 
groups. However, the constraint E(pi) = E(po) 5 required in the formula- 
tion in Joseph, du Berger and Belisle (1997), does not fit our case. Instead, 
we use a more flexible bivariate beta distribution (BIBETA), introduced 
in Olkin and Liu (2003), to model (po,pi) in our prior function. This BI- 
BETA distribution ensures that the marginal prior distributions of po and 
p\ are both beta distributions. More importantly, it also allows modeling 
the correlation between po and p\ in the range [0,1]. This BIBETA distri- 
bution, with parameters c/o, q\ and r, has a nice latent structure, that is, 
p = U/(U + W) and p\ = V/(V + W), where U, V and W are standard 
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gamma random variables with respective shape parameters qo,Qi and r. 
It follows that the joint density (prior distribution) of po and p\ is 



p qo 
7r{po,Pi) oc — 



_1 pf _1 (1 -Po) 9l+r_1 (l -pi) 90+r_1 



(1-poPi! 



90+91+'' 



We obtain the prior parameter values (qo,qi,r) by the method of mo- 
ments, solving equations /x = E(p ) = W(<7o + r), // d = E(pi - p ) = 
qi/{qi+r)-qo/(qo + r)andaj + n 2 d = E(p 1 -p ) 2 = {qo{qo + l)}/{{qo+r) 
(<R) + r + l)} + {gi(gi + l)}/{(gi + r)(gi+r + l)}-2 3 F 2 (gb + l,gi + l,(Zo + 
qi + r; qo + qi + r + I, qo + qi + r + I; I) . Here 3^2(0 denotes a hypergeo- 
metric function, which can be calculated using the software Mathematica, 
as mentioned in Olkin and Liu (2003). 
• Dependent hierarchical BIBETA prior. We also consider a hierarchical BI- 
BETA prior in which we assign hyperprior distributions on the parameters 
of the BIBETA distribution: 

(Po , Pi ) I <?o , <7i , r ~ BIBETA(<? , <?i , r) , 

g ~GAMMA(a ),gi~GAMMA(ai) and 

r~GAMMA(/3). 

The second level of the hyperprior model implies that £, = q%/(qi + r) ~ 
BETA(qj, p) and 77, = ft + r ~ GAMMA (a* + /3), for i = 0,1, which 
matches the conventional parameterization of hierarchical beta priors. 
This hierarchical BIBETA prior is more flexible than the regular BI- 
BETA distribution. To obtain the prior parameter values (ao,a\,f5), we 

solve equations /j,q = E(po), \i^ = E(p± — po) and a\ = var(j»i —po). Here, 
the marginal means of po and p\ are simply E(po) = cko/(«o + 0) and 
E(pi) = ol\/{q.\ + /3). The marginal variance var(pi — po) = E{var(pi — 
Po\qo,qi, r )} + var{E(pi — Po\qo,qi,'>')} involves three integrations which 
can be obtained by numerical integration. 

2.2. Summarize trial data of binary outcomes as a likelihood function. 
For the clinical trial with binary outcomes, the likelihood function of (po,Pi) 

is 

(2.1) ^(po^ilno^o^i^Ocxp^p^Cl-por^-^Cl-pir^ 1 -^ . 

2.3. Combine prior information and trial data as a posterior distribution. 
Following the Bayes formula, each of the four prior distributions can be 
incorporated with the likelihood function (2.1) to produce a joint posterior 
distribution of (po,pi), 

(2.2) 7r(p ,Pi|no,-X'o,«i,-X'i)oc7r(po ) PiK(Po ) Piko,-X'o,«i,^'i)- 
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The marginal posterior distribution for the parameter of interest <5(= p\ —po) 
is then 

/>min(— 5+1,1) 

(2.3) Tv(6\nQ,X ,m,Xi)= f(po,Po + 5\n ,X Q ,ni,X 1 )dp , 

from which exact Bayesian inferences for 5 can be drawn. 

In the case where the prior is modeled by two independent beta distribu- 
tions, the posterior distribution (2.2) is simply a product of two independent 
beta distributions, BETA(noAo + 9o> n o(l — Ao) + ro) and BETA(niAi + 
<7i,ni(l — X±) +7"i). However, in the other three cases both (2.2) and (2.3) 
are not of any known form of distributions and thus are difficult to manip- 
ulate. To this end, we propose a Metropolis-Hastings algorithm to simulate 
random samples from the posterior distributions (2.2) and (2.3). See Ap- 
pendix I [Xie et al. (2013)] for the proposed Metropolis-Hastings algorithm. 

The resulting marginal posterior density of 5 in (2.3) incorporates the 
prior evidence of expert opinions on the treatment improvement 5 with 
the evidence from clinical data. The full Bayesian approaches are theoret- 
ically sound and should provide a systematic solution to our problem in 
the Bayesian paradigm. However, as observed in Section 4.2, in the case of 
a skewed prior, the approaches may lead to the discrepant posterior phe- 
nomenon in that the posterior distributions of 6 can yield an estimate that 
is not between the two estimates derived from the corresponding prior distri- 
bution and the likelihood evidence! Further examination suggests that this 
phenomenon is quite general. See Section 4.2 for details. 

3. A frequentist Bayes compromise: A CD approach. In this section we 
use the so-called confidence distribution (CD) to develop a new approach for 
incorporating expert opinions with the trial data of binary outcomes. This 
approach follows frequentist principles and treats parameters as fixed non- 
random values. It provides an attractive alternative to Bayesian methods. 
In Section 3.1 we provide a definition and a brief review of the CD concept. 
In Sections 3.2-3.4 we develop the proposed CD approach. This approach 
can be simply outlined as follows: use a CD to summarize the prior in- 
formation or expert opinions (Section 3.2), use another CD (often from a 
likelihood function) to summarize the observed data from the clinical trial 
(Section 3.3), and then combine these two CDs into one CD (Section 3.4). 
This combined CD can be used to derive various inferences. Its role in fre- 
quentist inference is similar to that of a posterior distribution in Bayesian 
inference. This development provides yet another example in which a CD 
can provide useful statistical inference tools for problems where frequen- 
tist methods with desirable properties were previously unavailable. Bickel 
(2006) gives a similar development for normal clinical trials using an objec- 
tive Bayes argument. The review article by Xie and Singh (2012) contains 
further discussion. 
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3.1. A brief review of confidence distribution (CD). The CD concept is 
a simple one. For practical purposes, a CD is simply a distribution esti- 
mator for a parameter of interest. More specifically, instead of the usual 
point estimators or interval estimators (i.e., confidence intervals), CD uses 
a distribution function to estimate the parameter. The development of the 
CD has a long history; see, for example, Fisher (1930), Neyman (1941) and 
Lehmann (1993). But its associated inference schemes and applications have 
not received much attention until recently; see, for example, Efron (1998), 
Schweder and Hjort (2002, 2003, 2012), Singh, Xie and Strawderman (2001, 
2005, 2007), Lawless and Fredette (2005), Tian et al. (2011), Xie, Singh and 
Strawderman (2011) and Singh and Xie (2012). Although the CD approach 
is closely related to Fisher's fiducial approach, as seen in the classical liter- 
ature, the new CD developments are purely frequentist tools involving no 
fiducial reasoning. Further discussion of this point as well as the relations 
between CD-based inference and fiducial and Bayesian inferences can be 
found in the comprehensive review by Xie and Singh (2012). 

The following CD definition is formulated in Schweder and Hjort (2002) 
and Singh, Xie and Strawderman (2005) under the framework of frequentist 
inference. Singh, Xie and Strawderman (2005) demonstrated that this new 
version is consistent with the classical CD definition, and it is easier to use in 
practice. In the definition, 8 (fixed/nonrandom) is the unknown parameter 
of interest, is its parameter space, X n = (X\, . . . , X n ) T is the sample data 
set, and X is the corresponding sample space. 

Definition A. A function H n (-) = H n (X. n ,-) on X x 6 — > [0,1] is a 
confidence distribution (CD) for a parameter 6, if it meets the following two 
requirements: (Rl) For each given X n £ X, H n (-) is a continuous cumulative 
distribution function; (R2) At the true parameter value 6 = #o, H ti (6q) = 
H n (X n ,9o), as a function of the sample X n , follows the uniform distribution 
U[0,1\. 

The function H n {-) is an asymptotic confidence distribution (aCD), if the 
U[0, 1] requirement holds only asymptotically, and the continuity require- 
ment on H n {-) is dropped. Also, when it exists, h n {8) = H' n (9) is called a 
CD density or confidence density. 

The CD is a function of both the parameter and the random sample. It 
is also a sample-dependent distribution function on the parameter space, 
following requirement Rl. Conceptually, it estimates the parameter by a 
distribution function. As an estimation instrument, it is not much differ- 
ent from a point estimator or a confidence interval. For example, for point 
estimation, any single point (a real value or a statistic) can, in principle, 
be an estimate for a parameter, and we often impose additional restrictions 
to ensure that the point estimator has certain desired properties, such as 
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unbiasedness, consistency, etc. The two requirements in Definition A play 
roles similar to those restrictions. Specifically, Rl suggests that a sample- 
dependent distribution function on the parameter space can potentially be 
used as an estimate for the parameter. The U[0, 1] requirement in R2 ensures 
that the statistical inferences (e.g., point estimates, confidence intervals, p- 
values) derived from the CD have desired frequentist properties. 

Like a posterior distribution function, a CD contains a wealth of infor- 
mation for inference. It is a useful device for constructing all types of fre- 
quentist statistical inferences, including point estimates, confidence intervals 
and p-values. For instance, it is evident from requirement R2 that intervals 
obtained from a confidence distribution such as (H~ 1 (ai),H~ 1 (l — a^)) can 
always maintain the nominal level of 100(1 — ol\ — 02)% for coverage of 9. 
See Section 4 of Xie and Singh (2012) and references therein for more de- 
tails. Also, the CD concept is rather general. In fact, recent research has 
shown that Definition A encompasses a wide range of existing examples, in- 
cluding most examples in the classical development of Fisher's fiducial dis- 
tributions, bootstrap distributions, significance functions [p-value functions, 
Fraser (1991)], standardized likelihood functions, and certain Bayesian prior 
and posterior distributions; see, for example, Schweder and Hjort (2002), 
Singh, Xie and Strawderman (2005, 2007) and Xie and Singh (2012). 

Two examples of CDs which are relevant to the exposition of this paper 
are provided in Appendix II [Xie et al. (2013)]. Further, Singh, Xie and 
Strawderman (2005) and Xie, Singh and Strawderman (2011) developed 
a general method for combining CDs from independent studies, which is 
utilized in Section 3.4. 

3.2. Summarize external/prior information using a CD. A key task in 
our CD approach in this paper is to construct a CD which summarizes the 
treatment improvement 5, using only the information obtained prior to the 
clinical trial. In the following few paragraphs we use a set of modeling argu- 
ments to justify that the distribution underlying the histogram in Figure 1 
is a CD for the prior information. Some of these arguments are similar to 
those used in Genest and Zidek (1986) for Bayesian approaches, and our 
concluded prior CD matches in form the prior distribution suggested by 
Spiegelhalter, Freedman and Parmar (1994). This match of our prior CD 
with the commonly used Bayesian prior allows a comparison of the CD ap- 
proach and the Bayesian approach on an equal footing. Note that what we 
show here is only one of many possible modeling approaches to achieve our 
purpose. We will not dwell on this topic since the main goal of the paper is 
to study and compare inference approaches of incorporating expert opinions 
with clinical trial data. 

Example A. 2 in Appendix II [Xie et al. (2013)] shows that an informative 
prior could be viewed as a CD, provided that a sample space of the prior 
knowledge or past experiments can be defined. In the same spirit, we assume 
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that the expert opinions are based on past knowledge or experiments about 
the improvement 5 (the knowledge could be from experience or from similar, 
or informal, or even virtual experiments, but no actual data are available). 
This assumption ensures an informative prior and allows us to have a prior 
CD for the improvement 5. In particular, let Yo be a statistic (with the 
sample realization yo) that summarizes the information on 5 gathered from 
past experience or experiments. Let 5(Yq) be an estimator of 5 and also let 
F(t) = P{5(Yq) — 5 < t} be the cumulative distribution function of (5(Yq) — 
5). We assume for simplicity that F(t) does not involve unknown nuisance 
parameters or, if it does, that they are replaced by their respective consistent 
estimates (in this case the development here holds only asymptotically) . The 
prior knowledge then gives rise to the following CD (or asymptotic CD) for 5: 

(3.1) H (5) = l-F(S(y )-S), 

since the two requirements in Definition A hold for Hq(5). For illustration, 
consider the case in which {<5(Yo) — 5} /s^ — > N(0, 1), where Sq is an estimate 
of var(<5(Yo)). In this case, (3.1) is Hq(-) = $({■ — <5(yo)}/so)- Equivalently, 
iV(<5(yo),So) is a distribution estimate of 5. 

In practice, the realization yo of the prior trials is unobserved or only 
vaguely perceived. We rely on a survey of expert opinions to recover this prior 
information and Hq(5), as in our case study in Section 1.1. For simplicity, 
we assume that the I experts in the survey are randomly selected from a 
large pool of experts on the subject matter. We also assume that the experts 
are randomly exposed to some pre-existing experiments or knowledge, which 
in fact resembles a bootstrapping procedure. Denote by Y* the summary 
statistic of the pre-existing knowledge on <5 which the ith expert is exposed 
to and upon which his/her opinion is based. It follows that Y* is a bootstrap 
copy of Yo- Following Example 2.4 of Singh, Xie and Strawderman (2005), 
a CD for 5 from the bootstrap sample Y* is 

(3.2) H^(S) = 1 - P{6(Y*) - S(y ) < 5(Y ) - 8\Y }. 

This Hq^-) is usually the same as Hq(-) with probability 1 under some mild 
conditions, such as those required for standard bootstrap theory. 

However, the function 11^(5) only summarizes the prior knowledge which 
the ith expert is exposed to. We need to associate it with his/her "reported" 
opinion in the survey table such as in Table 1. Let us define, from the ith 
row of Table 1, an (empirical) cumulative distribution function 

12 

H 0*i( S ) = /,9i*k 1 (S>L k ), 
fc=l 

where 100^*^. is the /cth number reported in the ith row of Table 1, L^ is the 
lower bound of the fcth interval and lo is the indicator function. This iJg*(5) 
is the "reported" distribution for the improvement 5 by the ith expert. 
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In the ideal case, if the "reported" expert opinion faithfully recorded 
the "true" expert opinion and the "true" expert opinion truly reflects the 
"true" prior knowledge, Hq^S) and Hq^S) would be the same. But, there 
are often variations in reality. A detailed discussion on how to model such 
variations is provided in Section 5 as a concluding remark. We proceed 
with the popular "arithmetic pooling" approach, which is also articulated 
in Spiegelhalter, Freedman and Parmar (1994). An underlying assumption 
of arithmetic pooling is that the average of the "observed" expert opinions is 
an unbiased representation of the "true" prior knowledge. In our case, this 
is equivalent to assuming the additive error model, Hq^S) = 11^(6) + e.% 

such that I" 1 J2i=i e i ~ uniformly in 6, where e, = ei{8) is defined as the 
difference between Hq^S) and Hq^S) and is viewed as a random error for 
both the discrepancies between the "true" prior knowledge, the "true" expert 
opinion and the "reported" opinion of the ith. expert. Under this error model, 
where the heterogeneous deviation among experts are "averaged out," it 
follows that 

/ 12 

(3.3) H (6)^r 1 Y,H* }( 6 ) Ki Y,9^{S>L k ), 

i=l fc=l 

where WOgk = 100/ -1 Y2i=i9i*k are the group means reported in the last 
row of Table 1. From (3.3), the (standardized) histogram in Figure 1, that 
is, fust(S) = Efcii {9k/{Lk+i ~ L k )} l(L fc+1 >«5>L fe )' is clearly a suitable ap- 
proximation for the underlying confidence density function ho (5) =H (5). 
Here, the word "standardized" refers to scaling the histogram so that its 
area is 1. In our calculations in Section 4, we have used L13 = 0.24 as the 
upper bound of the 12th interval. 

3.3. Summarize clinical trial data as a CD. The task to summarize the 
clinical trial results into a CD is relatively easier. The maximum likelihood 
estimator of 5 is 5 = X± — Xq with variance Cj = var(<5) = pi(l — Pi)/n\ + 
p (l —Po)/no- An estimator of Cj is C\ = X\(\ - X\)]n\ + X (l - X )/n . 
If both m's tend to 00, we have (5 — $)/Cd — > N(0, 1). Therefore, an asymp- 
totic CD for the parameter 5 from the clinical trial is 

(3.4) H T (6) = <f>((8-8)/C d ). 

In other words, the distribution N(8, Cj) can be used to estimate 5. An 
alternative approach is to use the profile likelihood function of 5. Specifically, 
let £ pio f(8) be the log profile likelihood function of 5, and let 11^(8) = £ pTO { (5) — 
^prof(^)- Following Singh, Xie and Strawderman (2007), we can show that 

(3.5) H T (S)= h T (6)d0 with/»r(0) = -i— - — -oceWW 

7-1 j_ 1 e t n( t/ ) dd 
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is an asymptotic CD for 5. The two Ht(-) above are asymptotically equiv- 
alent when the n,'s tend to oo. 

3.4. Combine prior information and trial data as a combined CD. We 
can incorporate the prior CD Hq(5) with the CD Ht(5) from the clinical 
trial using a general CD combination method developed by Singh, Xie and 
Strawderman (2005). Xie, Singh and Strawderman (2011) showed that this 
general method and its extension can provide a unifying framework for most 
information combination methods used in current practice, including both 
the classical approach of combining p-values and the modern model-based 
(fixed and random effects models) meta-analysis approach. In our context, 
we are combining two CDs. Specifically, we let G c (t) = P(g c {Ui,U2) < t), 
where U\ and U 2 are independent U[0, 1] random variables, and g c {u\,u 2 ) 
is a continuous function from [0, 1] x [0, 1] to R which is monotonic (say, 
increasing) in each coordinate. Then, 

(3.6) H^(5) = G c {g c (H (S),H T (6))} 

is a combined CD for 5 which contains information from both expert opinions 
and the clinical trial. One simple choice is 

(3.7) g c (ui,u 2 )=wi^- 1 (u 1 ) + w 2 ^~ 1 (u 2 ), 

with weights ui\ = l/cr^ and w 2 = l/Gj, where a^ is an estimate of the 
standard deviation of Hq{$) (specifically, && is the standard deviation of 
the histogram in Figure 1 in our application, and it is also an estimate 
of the standard deviation of 6 in the normal case). In this case, G c (t) 
can be expressed as $(£/{(l/o^ + l/Cj) 1 / 2 }), and thus gives rise to the 
following combined CD for 5: H( C \S) = $({$- 1 (F (5))/ ( r (i + ^(iTx^)) 
/C d }/{{l/al + l/Cl)^}). 

When both Hq{5) and Ht{8) are normal (or asymptotically normal) CDs, 
the normal combination in (3.7) is the most efficient in terms of preserving 
Fisher information. In nonnormal cases, Singh, Xie and Strawderman (2005) 
studied several choices of the function g c and their Bahadur efficiency. But 
it remains an open question what choice of g c is most efficient in preserving 
Fisher information in a general nonnormal setting. Although we use the 
simple normal combination in (3.7) in this paper mostly for simplicity, our 
experience with the numerical studies has shown this combination to be 
quite adequate in most applications. In fact, in many nonnormal cases, it 
incurs very little loss of efficiency in terms of preserving Fisher information 
from both Hq(5) and Ht{5). 

In a Bayesian approach, it is a conventional practice to fit a prior den- 
sity curve to the histogram in Figure 1. Although this step is not needed in 
the proposed CD approach, we may sometimes also fit a density function to 
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/hist(^)- F° r example, we may fit a normal curve to the histogram of Figure 1 
by matching its first two moments, say, mean fid and variance a\. In this 
case, we have a normal CD from the expert opinions Hq(5) m$>((6— jj,d)/&d)- 
Incorporating it with Ht{5) in (3.4), we have the combined CD H^ c '(6) = 
<&((«* - ~5)/C d ) or N(6,CJ), where 6 = (6/C 2 d + f, d /aj)/(l/C 2 d + 1/aj) and 
C\ = (l/Cj + l/a^) _1 . This combined CD turns out to be the same as the 
posterior distribution function obtained from the univariate Bayesian ap- 
proach described in the Introduction when the normal prior n(6) ~ N(p,d, fr 2 ,) 
is used. Because Bayes's formula requires that we know /(data|<5) in the uni- 
variate Bayesian approach and this conditional density function /(data|<5) 
does not exist in our clinical setting, we have argued that the univariate 
Bayesian approach is not supported by Bayesian theory. The CD devel- 
opment, interestingly, provides theoretical support for using the posterior 
distribution N(5, C d ) from a non-Bayesian point of view if the prior distri- 
bution of expert opinions can be approximated by a normal distribution. In 
this case, the univariate Bayesian approach can also produce a result that 
"makes sense," and practically we can use either the CD approach or the 
univariate Bayesian approach. But this statement is not true in general. 

4. Application: Numerical results and comparisons. We now provide nu- 
merical studies to illustrate and compare the Bayesian and CD approaches 
discussed in Sections 2 and 3. In Section 4.1 we focus on the data from 
the migraine pain study outlined in Section 1.1. In Section 4.2 we simulate 
a skewed distribution of expert opinions and combine the simulated prior 
information with the clinical trial data. 

4.1. Normal priors: A case study of the migraine pain data. For the 
outcome PR2, the clinical data report that 31 out of 68 patients in the 
control group and 33 out of 59 patients in the treatment group achieved 
pain relief at 2 hours. Our goal is to incorporate the expert inputs reported 
in Figure 1 with these observed outcomes. 

We apply the full Bayesian approach described in Section 2 to analyze 
the PR2 data, using each of the four beta priors for (po,pi). The numeri- 
cal results are reported in Figures 2(a)— (d) and also in the first 12 rows of 
Table 2. The dotted lines in Figures 2(a)-(d) indicate the marginal prior 
density functions, and the dashed lines indicate the (standardized) profile 
likelihood function of 5 based only on the clinical trial data. The solid lines in 
Figures 2(a)-(d) are the marginal posterior distributions of 5. They are ob- 
tained by using the density estimation function density (•) in the R software 
and from 1000 Metropolis-Hasting samples of 5* =p\ —Pq- In each case and 
for each of the 1000 replications, the Metropolis-Hastings algorithm is iter- 
ated t = 25,000 times (burn-in). The acceptance rates are on average 0.0379, 
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FlG. 2. Outcomes of data analysis from the migraine pain data. 



0.0381, 0.0480 and 0.0485, respectively, in (a) to (d). For the independent 
beta prior, the exact formula of the posterior distribution is available, and it 
is plotted in Figure 2(a) as the dash-dot broken curve (it is barely visible in 
the plot, since it is almost identical to the solid curve). The close agreement 
of these two curves for the posterior distribution indicates that the MCMC 
chain of the Metropolis-Hastings algorithm has generally converged with 
t = 25,000 in this case. 

In applying the CD approach, we use both the raw histogram in Fig- 
ure 1 and the N(p,d,a^) distribution to approximate the prior CD of expert 
opinions. Figures 2(e)-(f) and the last six rows of Table 2 contain numeri- 
cal results. The dotted lines in Figures 2(e)-(f) indicate the prior CDs, the 
dashed lines indicate the profile likelihood function of 5 based only on the 
clinical trial data, and the solid lines are for the combined CDs for 5. 

In this particular example, all six approaches (four Bayesian and two CD 
approaches) seem to yield similar posterior or combined CD functions, and 
thus similar statistical inferences, regardless of which approach is used. Al- 
though the six marginal posterior or combined CD distributions are slightly 
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Table 2 

Numerical results from incorporating expert opinions on PR2 (summarized in Figure 1) 

with clinical data on PR2: mode, median, mean, Igo%, Igo% an d ^95% of the marginal 

prior, (normalized) profile likelihood function and marginal posterior of the parameter 5. 

Here, Igo%, Igo% an d ^95% denote the interval [100ct%-tile, 100(1 — a) % -tile] for 

ct — 10%, 5% and 2.5%, respectively. Included in the comparison are four full Bayesian 

approaches and two approaches based on confidence distributions (CDs) 

Mode Median Mean Iso% Iqo% ^95% 

Bayesian approaches 
Ind Beta prior Prior 0.049 0.047 0.048-0.037 0.130-0.060 0.153-0.080 0.173 

Likelihood 0.104 0.104 0.103-0.007 0.214-0.043 0.251-0.062 0.269 
Posterior 0.069 0.070 0.071 0.000 0.138-0.015 0.156-0.029 0.174 

Hierarchical Prior 0.048 0.047 0.048-0.036 0.130-0.059 0.154-0.080 0.175 

Beta prior Likelihood 0.104 0.104 0.103-0.007 0.214-0.043 0.251-0.062 0.269 

Posterior 0.082 0.071 0.070 0.000 0.143-0.0210.159-0.037 0.171 

Bi-Beta prior Prior 0.040 0.044 0.048-0.029 0.125-0.050 0.151-0.069 0.174 

Likelihood 0.104 0.104 0.103-0.007 0.214-0.043 0.251-0.062 0.269 
Posterior 0.093 0.091 0.091 0.015 0.165-0.004 0.190-0.025 0.209 

Hierarchical Prior 0.043 0.045 0.048-0.028 0.127-0.049 0.154-0.068 0.178 

Bi-Beta prior Likelihood 0.104 0.104 0.103-0.007 0.214-0.043 0.251-0.062 0.269 

Posterior 0.082 0.086 0.087 0.013 0.162-0.01 0.189-0.0310.207 

CD approaches 
CD with Prior CD 0.020 0.060 0.048-0.023 0.142-0.068 0.145-0.070 0.182 

histogram prior Likelihood 0.104 0.104 0.103-0.007 0.214-0.043 0.251-0.062 0.269 
Comb. CD 0.060 0.065 0.058 0.013 0.141-0.022 0.145-0.025 0.182 
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Comb. CD 


0.068 0.068 0.068 


0.0010.135- 


-0.018 0.154- 


-0.035 0.171 



different from one another, the difference appears to all fall within the ex- 
pected estimation error of the density curves. This result is not surprising, 
since, although skewed, the degree of skewness of the histogram in Figure 1 
does not appear to be great enough to render the normal approximation 
invalid. In fact, in this case, such a result is expected to hold if the central 
limit theory is in place for the clinical data of binary outcomes. It is worth 
noting here that the Bayesian approach implemented through an MCMC 
method is more demanding computationally. 

4.2. Skewed priors: A simulation study. The outcome in the previous 
subsection begs the question of whether there would be a significant differ- 
ence among the approaches if the distribution of expert opinions were un- 
ambiguously skewed, so that the normal approximation is clearly not valid. 
Conventional wisdom suggests that full Bayesian approaches based on beta 
priors, though computationally more intensive, would have advantages due 
to their flexibility in capturing distributions of various shapes. The CD ap- 
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Clinical Data/Prior Opinion 
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Fig. 3. Simulated prior distribution of5 = pi—po using BIBETA(6, 20, 2) distribution. 

proaches, allowing skewed priors, may also work. However, the numerical 
results reveal a surprising finding in the full Bayesian approaches. 

In this simulation study, we again use the observed clinical data on PR2, 
but replace Table 1 and Figure 1 of expert opinions with their simulated 
counterparts, assuming that the underlying prior distribution function of 
(po,Pi) is a bivariate beta distribution, BIBETA(6,20,2). The marginal 
means of the BIBETA(6, 20, 2) distribution are Ep = 6/(6 + 2) = 0.75 and 
Epi = 20/(20 + 2) « 0.90. Thus, the simulated prior represents a treat- 
ment effect improvement on average about 75% to 90%, which are simi- 
lar to those of the real trial in Section 4.1. Specifically, we simulate re- 
sponses of 100 patients for each of the 11 experts from BIBETA(6,20,2), 
tally the results in the format of Table 1 (not shown), and then plot them 
as a histogram in Figure 3. For a direct visual comparison, Figure 3 in- 
cludes the curve of the BIBETA(6, 20, 2) density. Also plotted in Figure 3 
is, as a common approach to fitting a skewed distribution, the following 
fitted log-normal density 0({log(5) - log (/id - c)}/{1 + cP-JiJid - c) 2 } 1 / 2 - 

|y 1 + dd/iP-d ~ c) 2 )/{5{l + a 2 d /{jl d - c) 2 ) 1/2 }. Here, fi d and a d are the mean 
and the standard deviation computed from the histogram, and c is a con- 
stant used to capture the shift of the log-normal distribution from 0. 

We apply the same four full Bayesian approaches used in Section 4.1 to 
incorporate the simulated expert opinions represented in Figure 3 with the 
clinical trial data on PR2. The four sets of prior parameters used in these four 
approaches are (qo,ro,qi,ri) = (14.66,4.88,46.81,4.68), (oio,f3o,ai,f3i) = 
(30.19,10.06,96.43,9.43), (q ,qi,r) = (6,20,2) and (a ,ai,/3) = (17.88, 59.60, 
5.96), respectively. In the third approach, we directly use the true set of prior 
parameters (qo, qi,r) = (6, 20, 2); in the other three, the prior parameters are 
obtained by the method of moments outlined in Section 2. In the simula- 
tion example, the Metropolis-Hasting algorithm is again iterated t = 25, 000 
times (burn-in), and it is repeated 1000 times to obtain 1000 independent 
Metropolis-Hasting samples of (pqjPi) i n eacn case. The acceptance rates 
are on average 0.0019, 0.0027, 0.0057 and 0.0036, respectively. 
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We also apply two CD combination approaches to incorporate the simu- 
lated expert opinions in Figure 3 with the clinical trial data on PR2. Similar 
to that in Section 4, the first CD approach directly uses the raw histogram 
in Figure 3. The second CD approach, in order to have a direct comparison 
with the Bayesian approach using the underlying prior BIBETA(6,20,2), 
combines the underlying marginal density function of 5 with the CD from 
the clinical trial data. Of course, in reality we do not know the underlying 
prior distribution or the underlying marginal density function of 5. Thus, 
the second CD approach has only theoretical value. Without relying on the 
underlying CD prior, we also consider the CD approach which combines the 
fitted log-normal distribution in Figure 3 with the CD from the clinical trial 
data. However, since the log-normal curve is evidently a poor fit for the 
histogram in Figure 3, the result for this CD approach, though not too far 
off, does not seem well justified and is thus omitted. 

Figures 4(a)-(d) show the results on the improvement 5 using the full 
Bayesian approaches, and Figures 4(e)-(f) show the results using the CD 
combination approaches. Figure 4 adopts the same notation and symbols 
used in Figure 2. Again, for the independent beta prior, the posterior density 
from the algorithm closely matches the one using its exact formula (dashed- 
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Fig. 4. Outcomes of data analysis from simulated data with a skewed prior. 
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Table 3 

Numerical results from incorporating the simulated expert opinions (summarized in 

Figure 3) with clinical data on PR2: the mode, median, mean, Iso%, -^90% an d ^95% of 

the marginal prior, (normalized) profile likelihood function, and marginal posterior 

of the parameter 5. Here, Igo%, Igo% an d -^95% denote the interval [100ct%-tile, 

100(1 — a) %-tile] for a = 10%, 5% and 2.5%, respectively. The prior parameters 

in the four full Bayesian approaches are (qo,fo,qi,ri) = (14.66,4.88,46.81,4.68), 

(ao,/3o,ai,/?i) = (30.19,10.06,96.43,9.43), (go, qi, r) = (6,20,2) and 

(ato,ai,0) = (17.88,59.60,5.96), respectively 



Mode Median Mean 



1 80% 



1 90% 



195% 



Bayesian approaches 
Independent Prior 0.128 0.153 0.159 0.033 0.297 

Beta prior Likelihood 0.104 0.104 0.103-0.007 0.214 
Posterior 0.211 0.212 0.212 0.120 0.306 



Hierarchical Prior 

Beta prior Likelihood 
Posterior 

Independent Prior 

Bi-Beta prior Likelihood 
Posterior 

Hierarchical Prior 

Bi-Beta prior Likelihood 
Posterior 



CD with 
histogram 
prior 

CD with 
marginal 



Prior CD 
Likelihood 
Comb. CD 

Prior CD 
Likelihood 



0.145 
0.104 
0.214 

0.095 
0.104 
0.202 

0.120 
0.104 
0.232 

0.075 
0.104 
0.100 

0.095 
0.104 



0.152 0.159 0.0310.295 

0.104 0.103-0.007 0.214 

0.212 0.212 0.122 0.302 

0.140 0.159 0.042 0.306 

0.104 0.103-0.007 0.214 

0.203 0.201 0.112 0.288 

0.146 0.159 0.056 0.281 

0.104 0.103-0.007 0.214 

0.225 0.222 0.138 0.305 



CD approaches 



0.125 
0.104 
0.110 

0.140 
0.104 



0.159 
0.103 
0.118 

0.159 
0.103 



0.025 0.275 

-0.007 0.214 

0.035 0.200 

0.042 0.306 
-0.007 0.214 



Bi-Beta prior Comb. CD 0.099 0.099 0.119 0.026 0.191 



0.003 0.340 -0.023 0.379 
-0.043 0.251 -0.062 0.269 
0.089 0.330 0.066 0.346 

-0.0010.337-0.027 0.375 
-0.043 0.251 -0.062 0.269 
0.094 0.329 0.078 0.348 

0.027 0.360 0.017 0.407 
-0.043 0.251 -0.062 0.269 
0.084 0.315 0.0610.339 

0.039 0.326 0.028 0.366 
-0.043 0.251 -0.062 0.269 
0.116 0.329 0.1010.340 



-0.025 0.325-0.025 0.375 
-0.043 0.251 -0.062 0.269 
0.000 0.225-0.005 0.250 

0.027 0.360 0.017 0.407 
-0.043 0.251 -0.062 0.269 
0.007 0.209-0.0110.246 



dotted line), indicating that the MCMC chain of the Metropolis-Hasting 
algorithm has generally converged in this case. Also, we report in Table 3 
the numerical results from the six approaches: the mode, median, mean and 
confidence/credible intervals of the marginal priors, the profile likelihood 
function and the marginal posteriors of 5. 

The CD approaches perform exactly as anticipated. However, examining 
the modes of the three curves in each of Figures 4(a)-(d), we notice that the 
mode of the marginal posterior distribution (solid curve) lies to the right 
of the peaks of both the marginal prior distribution (dotted curve) and the 
profile likelihood function (dashed curve). The numerical results in Table 3 
also confirm that the mode, median and mean of the marginal posterior dis- 
tributions of 5 from all four full Bayesian approaches are much larger than 
their counterparts from the corresponding marginal priors and profile like- 



22 XIE, LIU, DAMARAJU AND OLSON 

lihood functions. This discrepant posterior phenomenon is counterintuitive! 
For instance, if we use the means as our point estimators, we would report 
from Figure 4(c) that the experts suggest about 15.9% improvement and 
the clinical evidence suggests about 10.3% improvement but, incorporating 
them together, the overall estimator of the treatment effect is 20.1%, which is 
bigger than either that reported by the experts or that suggested by the clin- 
ical data. This result is certainly not easy to explain to clinicians or general 
practitioners of statistics. In any event, it seems worthwhile to investigate 
further and see what ramifications this intriguing phenomenon may have. 

To further examine the phenomenon, we compare the percentiles of the 
marginal priors, the profile likelihood function and the marginal posterior 
distributions of the treatment effect 5 in Table 3. In each of the four Bayesian 
approaches, the 95% posterior credible interval lies inside the corresponding 
95% interval from the prior and has substantial overlap with the correspond- 
ing 95% interval from the profile likelihood. But this is not always the case at 
the 80% and 90% levels, where several posterior credible intervals do not lie 
within the boundaries of the corresponding intervals from the priors and the 
likelihood functions. The outcome of whether the posterior credible interval 
lies within the boundaries of the other two depends on the choice of the credi- 
ble level. Thus, using credible intervals as our primary inferential instrument 
cannot completely avoid the discrepant posterior phenomenon either. 

To better understand this phenomenon, we plot in Figures 5(a)— (d) the 
contours of the joint prior distribution tt(pq,pi), the likelihood function of 
(po,Pi) and the joint posterior distribution of (po,pi) for each of the four full 
Bayesian approaches. We show that certain projections of Figures 5(a)-(d) 
lead to the marginal distributions and plots in Figures 4(a)-(d). As marked 
in Figure 5, the center (mode) of each contour plot is on a line 5 = p\ — po 
(or p\ =po + $)■ Varying 5 in 5 = p\ — po produces a family of parallel lines, 
all making a 45° angle with the horizontal axis. The projections of the three 
distributions along these parallel lines onto the interval of possible values of 
5, — 1 < 5 < 1, lead to the plots of marginal distributions in Figures 4(a)-(d). 
The yellow curves in (a) are a posterior contour plot from the exact formula. 
Although the contour plots of the posterior distributions sit between those 
of the prior distributions and the likelihood function, their projected peaks 
(modes) are more to the upper-left than those of the marginal priors and 
the profile likelihood function. Further investigation indicates that this is 
a genuine mathematical phenomenon which holds for all four Bayesian ap- 
proaches and not merely an aberration due to some special circumstances. In 
fact, when the center (mode) of a posterior distribution is not in the interval 
joining the two centers (modes) of the joint prior and likelihood functions, 
as is often the case with skewed distributions (and even sometimes with 
nonskewed distributions), there always exists a linear direction, say, apo + bpi 
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(a) Indepedent Beta Prior 



(b) Hierarcical Beta Prior 
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(c) BiBeta Prior 




(d) Hierarchical BiBeta Prior 



Fig. 5. Contour plots: joint prior distribution (in blue), joint likelihood function (in 
black) and estimated posterior distribution (in red) of(po,pi). These two-dimensional dis- 
tributions are projected along the 45° lines of 8 — p\ — po onto the interval of possible 
values of S, —1 < S < 1, leading to Figures 4(a)-(d). The yellow curves in (a) are a pos- 
terior contour plot from the exact formula. 



with some coefficients a and b, along which the marginal posterior fails to fall 
between the marginal prior and likelihood functions of the same parameter. 
Reparametrization, if done carefully, such as considering joint distribution of 
(5,6) = (pi —po,pi + po) or others, may sometimes help hide the discrepant 
posterior phenomenon on the 5 direction, but cannot eliminate it systemat- 
ically. We have found no discussion of such a geometric finding on marginal- 
ization in the Bayesian literature. See further discussion in Section 5. 

5. Conclusions and additional remarks. To incorporate expert opinions 
in the analysis of a clinical trial with binary outcomes in a meaningful way, 
we have developed and studied several bivariate full Bayesian approaches as 
well as a CD approach. We show that both the Bayesian and the proposed 
CD approaches may provide viable solutions. Although the paper focuses 
on expert opinions in pharmaceutical studies, the methodologies developed 
here can be applied to incorporating other types of priors or external infor- 
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mation, for example, historical knowledge. These methodologies should also 
be useful in many other fields, including finance, social science studies and 
even homeland security, where prior knowledge, expert opinions and histori- 
cal information are much valued and need to be incorporated with observed 
data in an effective and justifiable manner. 

In this paper we have examined and compared both Bayesian and CD 
approaches. Although there does not exist the usual theoretical platform 
for a direct comparison on efficiency or lengths of intervals, the comparison 
can be summarized in three aspects: empirical results, computational effort 
and theoretical consideration. The empirical findings from Figure 2 show 
that, as long as the histogram of the expert opinions can be well approxi- 
mated by a normal distribution, all approaches considered in this paper per- 
form comparably, in terms of the posterior distribution or the combined CD 
and their corresponding inferences. However, if the histogram is skewed, the 
full Bayesian approach may produce the discrepant posterior phenomenon, 
which is difficult to avoid in theory and difficult to explain in applications. 
The CD approach avoids such a phenomenon. 

In terms of the computational effort, the bivariate full Bayesian approach 
is demanding since it requires running a large-scale simulation using an 
MCMC algorithm, while the proposed CD approach is both explicit and 
straightforward to compute. In addition, the CD approach can directly in- 
corporate the histogram of expert opinions without an additional effort of 
curve fitting. 

Theoretically, since it is not possible to find a "marginal" likelihood of 5 
[i.e., a conditional density function /(data|<5)], any univariate Bayesian ap- 
proach focusing on the parameter of interest 5 is not supported by Bayesian 
theory. A full Bayesian solution is to jointly model (po,Pi) [or a reparame- 
terization of the pair (po,pi)] and, subsequently, make inferences using the 
marginal posterior of 5. The full Bayesian approaches developed in the paper 
follow exactly this procedure and are theoretically sound. The proposed CD 
approach is developed strictly under the frequentist paradigm and is also 
theoretically sound. Unlike the full Bayesian approaches, the CD approach 
can focus directly on the parameter of interest 5 without the additional bur- 
den of modeling other parameters or the correlation between po and p\ , and 
thus appears to have some advantage in this application. 

A surprising finding in this research is the discrepant posterior phenomenon 
occurring in the full Bayesian approaches under skewed priors. Although it 
may be mitigated if the prior is only slightly skewed or is in accordance 
with the likelihood function, the phenomenon is intrinsically mathematical. 
How much skewness is required to produce the phenomenon depends on all 
elements involved, including shapes and locations of both the likelihood and 
the prior. The reactions to this phenomenon we have encountered thus far 
fall roughly into two groups. One group views the discrepant posterior as 
a mathematical truth and, if one has faith in the choice of the prior, one 
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should proceed to make inference using this marginal posterior, even though 
the outcome is counterintuitive. The other worries about the counterintuitive 
result and would try to find alternative approaches of a good operating char- 
acteristic for the particular problem at hand, even at the cost of abandoning 
the mathematically solid full Bayesian approach in favor of less rigorous ap- 
proaches such as the univariate Bayesian approach described in Section 1. 
In any case, the lesson learned from the Bayesian analysis here is that the 
choice of the prior really matters and it needs to be in some agreement with 
the likelihood function, which is similar in spirit to what was referred to as 
"model dependent" in Berger (2006). We also consider this a manifestation 
of an inherent difficulty in modeling accurately the joint effects of the two 
treatments as reflected in po and p\ and their correlation. This difficulty 
illustrates again the complexity of the practice of incorporating external 
information in trials with binary outcomes. 

The discrepant posterior phenomenon is caused by "marginalization," but 
it is different from the "marginalization paradox" discussed in Dawid, Stone 
and Zidek (1973) and Berger (2006). In particular, the marginalization para- 
dox in Dawid, Stone and Zidek (1973) refers to the phenomenon that the 
marginal posterior of 7r(#|data) obtained from the joint prior Tr(8,(fi) and 
its full model /(data|0,<^>) can sometimes be quite different ("incoherent") 
from the posterior 7r((9|data) obtained by applying the Bayes formula di- 
rectly to its marginal prior n(0) and marginal model /(data|(9), even though 
the marginal prior n(8) and marginal model /(data|0) are consistent ("co- 
herent") with the joint prior tt(0, <f>) and the full model f (data\9 , <p) . Here, 
4> represents nuisance parameters. This paradox is different from what we 
observed here. In our example, it is not possible to have the marginal model 
/(data|0), and the discrepant posterior phenomenon in the full Bayesian 
approach is that the estimate derived from the marginal posterior 7r(#|data) 
may not be between the estimates from the marginal prior tt(0) and the 
profile likelihood function £(0|data). This is counterintuitive in practical ap- 
plications. 

It is worth noting that the discussion and implications of the discrepant 
posterior phenomenon extend beyond the setting of binary outcomes to any 
multivariate setting involving skewed distributions. As long as the center 
(mode) of a posterior distribution is not in the interval joining the centers 
(modes) of the joint prior and the likelihood function, there always exists 
a direction along which the center (mode) of the marginal posterior fails 
to fall between the centers (modes) of marginal prior and the profile likeli- 
hood function. This phenomenon has implications in the general practice of 
Bayesian analysis. For instance, many researchers in machine learning and 
other fields routinely draw conclusions solely based on marginal posterior 
distributions without checking (or it is very difficult to check) the valid- 
ity of such conclusions. The discrepant posterior phenomenon suggests that 
further care is needed. 
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Many methods have been introduced to model "reported" expert opin- 
ions, account for their potential errors and heterogeneity, and subsequently 
pool them; see Genest and Zidek (1986) for an excellent review of this topic. 
In particular, Spiegelhalter, Freedman and Parmar (1994) described "arith- 
metic and logarithm pooling" as the "two simplest methods" for pooling 
expert opinions, and articulated a "strong preference" "for arithmetic pool- 
ing to obtain an estimated opinion of a typical participating clinician." The 
underlying assumption of arithmetic pooling is that the average of the "ob- 
served" expert opinions is an unbiased representation of the "true" prior 
knowledge. This assumption naturally facilitates the additive error model 
used in Section 3.2 for summarizing "reported" expert opinions in a CD. 
Clearly, the modeling principle and development we used to summarize the 
expert opinions in a CD are similar in spirit to those discussed in Genest and 
Zidek (1986) and Spiegelhalter, Freedman and Parmar (1994) for Bayesian 
approaches. 

The modeling framework developed in Section 3.2 is sufficiently flexible 
and can be modified to accommodate various ways of aggregating expert 
opinions. In particular, it can incorporate weighting schemes to develop a 
robust method against extreme expert opinions, introduce additional terms 
to reflect biased opinions or additional uncertainties, or use the geometric 
mean as a way to pool the expert opinions. Some of these extensions (e.g., the 
robust method) by themselves could be attractive choices to produce priors 
in the context of traditional Bayesian approaches. Due to space limitations, 
we will not pursue these extensions in this paper. 

In a different direction, we have also considered modeling the survey data 
of expert opinions using a traditional random effects approach. In such a 
model, we provide a regression model for the responses of the 100 "virtual 
patients" of each expert (as described in Section 1.1) and add a random 
effect term to account for the expert-to-expert variation. However, it seems 
nontrivial to overcome the technical difficulty in making the modeling pro- 
cess free of the number (100) of "virtual patients." In fact, this difficulty led 
us to the bootstrap argument in Section 3.2, in which we mimic a potential 
model of expert exposure to pre-existing experiments. Clearly, there remain 
many challenging issues in modeling the survey data of expert opinions, even 
for the seemingly simple binary setting. 
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SUPPLEMENTARY MATERIAL 
Appendix: MCMC algorithm and CD examples 

(DOI: 10.1214/12-AOAS585SUPP; .pdf). Appendix I contains a Metropolis- 
Hastings algorithm used in Section 2. Appendix II presents two CD examples 
that are relevant to the exposition of this paper. 
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